In the initial phase of this analysis, I prioritize meticulous data preparation and thorough examination, commonly referred to as data preprocessing and exploratory data analysis (EDA). This essential step serves as the cornerstone of the entire study, with the aim of unraveling insights and identifying anomalies within the dataset. Subsequently, a combination of unsupervised and supervised machine learning techniques is employed to extract patterns, categorize tumors, and forecast numerical features. This multifaceted approach maximizes the utilization of the dataset's richness, facilitating the derivation of actionable conclusions. Moreover, a strong emphasis is placed on ensuring clarity and transparency throughout the methodology, with a focus on reproducibility. Additionally, ethical implications, particularly in the realm of healthcare applications of machine learning, are conscientiously considered.
Pre-processing the dataset and conducting Exploratory Data Analysis (EDA) are vital for understanding data structure and preparing it for analysis. This phase includes handling missing, duplicated, or outlier values to ensure data integrity. Transforming data may be necessary for normalization or scaling, while categorical features are encoded numerically. Splitting the dataset into training and test sets facilitates model evaluation. Feature engineering techniques, like extraction and selection, enhance predictive power. Informative plots and tables visualize distributions, correlations, and trends. Statistical assumptions are assessed to validate analytical approaches. By pre-processing and conducting EDA, analysts gain insights into the dataset, enabling informed decision-making and robust model development.
Load the dataset using the file path of the dataset. Missing values in a dataset can introduce bias or errors during analysis, as algorithms may struggle to handle them effectively, resulting in suboptimal performance. Additionally, duplicate rows may indicate errors in data collection, potentially compromising the integrity of the dataset. Ensuring data integrity is paramount, particularly because duplicate rows can contribute to overfitting, thereby impeding models' ability to generalize to new data. Therefore, it is crucial to address both missing values and duplicates through appropriate data cleaning processes to uphold the reliability of analyses and models.
Dealing with missing values and duplicates is crucial for effective data cleaning and preparation, significantly influencing the quality and reliability of subsequent analyses or machine learning models. To uphold data integrity, systematically removing missing values and duplicate entries is essential. Neglecting or mishandling these issues may result in biased results, diminished model performance, and unreliable insights. Thus, giving meticulous attention to removing missing values and duplicates is vital for improving the strength and accuracy of data-driven analyses, ensuring the validity of conclusions drawn from the dataset.
import pandas as pd
# Path of the dataset
file_path = 'D:/Karthika University/MS4S16MachineLearning/Assignment/MS4S16_Dataset.csv'
# Load the dataset using pandas(Data set name data_set)
data_set = pd.read_csv(file_path)
# Find the missing valu
missing_values = data_set.isnull().sum()
print("Missing Values:\n", missing_values)
# remove missingvalues
data_set = data_set.dropna()
# Find the duplicated rows
duplicated_rows = data_set[data_set.duplicated()]
print("duplicat_rows:\n", duplicated_rows)
# Duplicated rows remove from data set
dataset_cleaned = data_set.drop_duplicates()
print(f"Number of duplicated rows: {len(duplicated_rows)}")
print(f"Shape of the original DataFrame: {data_set.shape}")
print(f"Shape of the DataFrame after removing duplicates: {dataset_cleaned.shape}")
# Cleanned dataset name is data
data = pd.DataFrame(dataset_cleaned)
print(data)
Missing Values:
id 3
diagnosis 3
radius_mean 5
texture_mean 6
perimeter_mean 4
area_mean 5
smoothness_mean 3
compactness_mean 4
concavity_mean 4
concave points_mean 8
symmetry_mean 3
fractal_dimension_mean 4
radius_se 6
texture_se 8
perimeter_se 3
area_se 6
smoothness_se 6
compactness_se 7
concavity_se 8
concave points_se 9
symmetry_se 8
fractal_dimension_se 7
radius_worst 13
texture_worst 21
perimeter_worst 6
area_worst 4
smoothness_worst 9
compactness_worst 4
concavity_worst 3
concave points_worst 6
symmetry_worst 4
fractal_dimension_worst 13
dtype: int64
duplicat_rows:
id diagnosis radius_mean texture_mean perimeter_mean area_mean \
493 914062.0 M 18.01 -999.00 118.40 1007.0
570 92751.0 B 7.76 24.54 47.92 181.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
493 0.10010 0.12890 0.117 0.07762
570 0.05263 0.04362 0.000 0.00000
... radius_worst texture_worst perimeter_worst area_worst \
493 ... 21.530 26.06 143.40 1426.0
570 ... 9.456 30.37 59.16 268.6
smoothness_worst compactness_worst concavity_worst \
493 0.13090 0.23270 0.2544
570 0.08996 0.06444 0.0000
concave points_worst symmetry_worst fractal_dimension_worst
493 0.1489 0.3251 0.07625
570 0.0000 0.2871 0.07039
[2 rows x 32 columns]
Number of duplicated rows: 2
Shape of the original DataFrame: (482, 32)
Shape of the DataFrame after removing duplicates: (480, 32)
id diagnosis radius_mean texture_mean perimeter_mean \
0 842302.0 M 17.99 10.38 122.80
1 842517.0 M 20.57 17.77 132.90
2 84300903.0 M 19.69 21.25 130.00
3 84348301.0 M 11.42 20.38 77.58
4 84358402.0 M 20.29 14.34 135.10
.. ... ... ... ... ...
564 926125.0 M 20.92 25.09 143.00
565 926424.0 M 21.56 22.39 142.00
567 926954.0 M 16.60 28.08 108.30
568 927241.0 M 20.60 29.33 140.10
569 92751.0 B 7.76 24.54 47.92
area_mean smoothness_mean compactness_mean concavity_mean \
0 1001.0 0.11840 0.27760 0.30010
1 1326.0 0.08474 0.07864 0.08690
2 1203.0 0.10960 0.15990 0.19740
3 386.1 0.14250 0.28390 0.24140
4 1297.0 0.10030 0.13280 0.19800
.. ... ... ... ...
564 1347.0 0.10990 0.22360 0.31740
565 1479.0 0.11100 0.11590 0.24390
567 858.1 0.08455 0.10230 0.09251
568 1265.0 0.11780 0.27700 0.35140
569 181.0 0.05263 0.04362 0.00000
concave points_mean ... radius_worst texture_worst perimeter_worst \
0 0.14710 ... 25.380 17.33 184.60
1 0.07017 ... 24.990 23.41 158.80
2 0.12790 ... 23.570 25.53 152.50
3 0.10520 ... 14.910 26.50 98.87
4 0.10430 ... 22.540 16.67 152.20
.. ... ... ... ... ...
564 0.14740 ... 24.290 29.41 179.10
565 0.13890 ... 25.450 26.40 166.10
567 0.05302 ... 18.980 34.12 126.70
568 0.15200 ... 25.740 39.42 184.60
569 0.00000 ... 9.456 30.37 59.16
area_worst smoothness_worst compactness_worst concavity_worst \
0 2019.0 0.16220 0.66560 0.7119
1 1956.0 0.12380 0.18660 0.2416
2 1709.0 0.14440 0.42450 0.4504
3 567.7 0.20980 0.86630 0.6869
4 1575.0 0.13740 0.20500 0.4000
.. ... ... ... ...
564 1819.0 0.14070 0.41860 0.6599
565 2027.0 0.14100 0.21130 0.4107
567 1124.0 0.11390 0.30940 0.3403
568 1821.0 0.16500 0.86810 0.9387
569 268.6 0.08996 0.06444 0.0000
concave points_worst symmetry_worst fractal_dimension_worst
0 0.2654 0.4601 0.11890
1 0.1860 0.2750 0.08902
2 0.2430 0.3613 0.08758
3 0.2575 0.6638 0.17300
4 0.1625 0.2364 0.07678
.. ... ... ...
564 0.2542 0.2929 0.09873
565 0.2216 0.2060 0.07115
567 0.1418 0.2218 0.07820
568 0.2650 0.4087 0.12400
569 0.0000 0.2871 0.07039
[480 rows x 32 columns]
Not all columns in this dataset display a normal distribution, which can influence statistical assumptions. In such cases, it's recommended to adopt analytical approaches tailored for non-normal distributions. When using machine learning models or conducting statistical analyses, awareness of non-normal distribution is crucial. Consider employing techniques robust to deviations from normality, like non-parametric methods or transformations, to ensure accurate and reliable results.
import pandas as pd
import numpy as np
from scipy.stats import shapiro
import matplotlib.pyplot as plt
# Assuming original data loaded into a DataFrame named 'original_data'
original_data = data
# Create a function to test normality and generate histograms
def test_normality_and_plot(df_column):
# Convert the column to a numeric type, ignoring non-numeric values
cleaned_data = pd.to_numeric(df_column, errors='coerce').dropna()
# Check if there are at least three non-missing values
if len(cleaned_data) < 3:
print(f"{df_column.name}: Insufficient data for normality test (less than 3 values).\n")
return
# Perform the Shapiro-Wilk test
stat, p_value = shapiro(cleaned_data)
# Plot histogram
plt.figure(figsize=(8, 6))
plt.hist(cleaned_data, bins='auto', color='blue', edgecolor='black')
plt.title(f'Histogram of {df_column.name}')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
# Print test results
print(f"{df_column.name}:")
print(f"Shapiro-Wilk Test - p-value: {p_value}")
if p_value > 0.05:
print("Data appears to be normally distributed.\n")
else:
print("Data does not appear to be normally distributed.\n")
# Iterate through the columns using a for loop
for column in original_data.columns:
test_normality_and_plot(original_data[column])
id: Shapiro-Wilk Test - p-value: 2.04256066756914e-40 Data does not appear to be normally distributed. diagnosis: Insufficient data for normality test (less than 3 values).
radius_mean: Shapiro-Wilk Test - p-value: 9.995522747170693e-14 Data does not appear to be normally distributed.
texture_mean: Shapiro-Wilk Test - p-value: 3.733706543469508e-33 Data does not appear to be normally distributed.
perimeter_mean: Shapiro-Wilk Test - p-value: 2.5340305212246533e-14 Data does not appear to be normally distributed.
area_mean: Shapiro-Wilk Test - p-value: 4.219632481566258e-21 Data does not appear to be normally distributed.
smoothness_mean: Shapiro-Wilk Test - p-value: 0.021688221022486687 Data does not appear to be normally distributed.
compactness_mean: Shapiro-Wilk Test - p-value: 6.089557027036396e-16 Data does not appear to be normally distributed.
concavity_mean: Shapiro-Wilk Test - p-value: 7.982317468061152e-20 Data does not appear to be normally distributed.
concave points_mean: Shapiro-Wilk Test - p-value: 9.80908925027372e-44 Data does not appear to be normally distributed.
symmetry_mean: Shapiro-Wilk Test - p-value: 1.984799144869671e-41 Data does not appear to be normally distributed.
fractal_dimension_mean: Shapiro-Wilk Test - p-value: 1.2990036764291054e-42 Data does not appear to be normally distributed.
radius_se: Shapiro-Wilk Test - p-value: 5.4718801282130944e-27 Data does not appear to be normally distributed.
texture_se: Shapiro-Wilk Test - p-value: 1.1859550658426937e-17 Data does not appear to be normally distributed.
perimeter_se: Shapiro-Wilk Test - p-value: 3.284580112550128e-28 Data does not appear to be normally distributed.
area_se: Shapiro-Wilk Test - p-value: 2.9167252963495484e-33 Data does not appear to be normally distributed.
smoothness_se: Shapiro-Wilk Test - p-value: 2.468378084179224e-22 Data does not appear to be normally distributed.
compactness_se: Shapiro-Wilk Test - p-value: 2.3120763398623534e-22 Data does not appear to be normally distributed.
concavity_se: Shapiro-Wilk Test - p-value: 4.706028250861632e-30 Data does not appear to be normally distributed.
concave points_se: Shapiro-Wilk Test - p-value: 8.647694995324502e-16 Data does not appear to be normally distributed.
symmetry_se: Shapiro-Wilk Test - p-value: 9.280427270639034e-23 Data does not appear to be normally distributed.
fractal_dimension_se: Shapiro-Wilk Test - p-value: 1.6255062186167878e-43 Data does not appear to be normally distributed.
radius_worst: Shapiro-Wilk Test - p-value: 1.0698616399906931e-16 Data does not appear to be normally distributed.
texture_worst: Shapiro-Wilk Test - p-value: 1.3853089512849692e-05 Data does not appear to be normally distributed.
perimeter_worst: Shapiro-Wilk Test - p-value: 1.8719321649731496e-34 Data does not appear to be normally distributed.
area_worst: Shapiro-Wilk Test - p-value: 1.2546626015983025e-23 Data does not appear to be normally distributed.
smoothness_worst: Shapiro-Wilk Test - p-value: 0.0018222584621980786 Data does not appear to be normally distributed.
compactness_worst: Shapiro-Wilk Test - p-value: 1.536926724414147e-17 Data does not appear to be normally distributed.
concavity_worst: Shapiro-Wilk Test - p-value: 8.437772914182337e-15 Data does not appear to be normally distributed.
concave points_worst: Shapiro-Wilk Test - p-value: 4.832046762714981e-09 Data does not appear to be normally distributed.
symmetry_worst: Shapiro-Wilk Test - p-value: 2.710039033697179e-16 Data does not appear to be normally distributed.
fractal_dimension_worst: Shapiro-Wilk Test - p-value: 3.0740765890703364e-16 Data does not appear to be normally distributed.
When dealing with a dataset that is not normally distributed, addressing outliers becomes crucial for maintaining the integrity of statistical analyses and machine learning models. Outliers, or extreme values, can significantly influence summary statistics and model performance. One common method for identifying and handling outliers is the Interquartile Range (IQR) method.
In cases where the dataset includes categorical variables, like the 'diagnosis' column in this scenario, directly applying the IQR method might not be feasible. To overcome this limitation, encoding the categorical data is necessary, converting it into a numerical format that allows for the application of outlier detection techniques.
acknowledging and managing outliers is essential for robust data analysis. The choice of an appropriate method, such as IQR, is contingent on the distribution of the data and the nature of the variables involved. The encoding of categorical variables facilitates outlier detection and removal, ensuring a more reliable and accurate analysis.
# Encoded data
encoded_data = data
# Assuming 'diagnosis' is the categorical column
diagnosis_encoded = pd.get_dummies(encoded_data['diagnosis'], prefix='diagnosis', drop_first=True)
# Drop the original 'diagnosis' column and concatenate the one-hot encoded columns
encoded_data = pd.concat([encoded_data.drop('diagnosis', axis=1), diagnosis_encoded], axis=1)
# Display the encoded dataset
print(encoded_data)
id radius_mean texture_mean perimeter_mean area_mean \
0 842302.0 17.99 10.38 122.80 1001.0
1 842517.0 20.57 17.77 132.90 1326.0
2 84300903.0 19.69 21.25 130.00 1203.0
3 84348301.0 11.42 20.38 77.58 386.1
4 84358402.0 20.29 14.34 135.10 1297.0
.. ... ... ... ... ...
564 926125.0 20.92 25.09 143.00 1347.0
565 926424.0 21.56 22.39 142.00 1479.0
567 926954.0 16.60 28.08 108.30 858.1
568 927241.0 20.60 29.33 140.10 1265.0
569 92751.0 7.76 24.54 47.92 181.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.30010 0.14710
1 0.08474 0.07864 0.08690 0.07017
2 0.10960 0.15990 0.19740 0.12790
3 0.14250 0.28390 0.24140 0.10520
4 0.10030 0.13280 0.19800 0.10430
.. ... ... ... ...
564 0.10990 0.22360 0.31740 0.14740
565 0.11100 0.11590 0.24390 0.13890
567 0.08455 0.10230 0.09251 0.05302
568 0.11780 0.27700 0.35140 0.15200
569 0.05263 0.04362 0.00000 0.00000
symmetry_mean ... texture_worst perimeter_worst area_worst \
0 0.2419 ... 17.33 184.60 2019.0
1 0.1812 ... 23.41 158.80 1956.0
2 0.2069 ... 25.53 152.50 1709.0
3 0.2597 ... 26.50 98.87 567.7
4 0.1809 ... 16.67 152.20 1575.0
.. ... ... ... ... ...
564 2.1000 ... 29.41 179.10 1819.0
565 0.1726 ... 26.40 166.10 2027.0
567 0.1590 ... 34.12 126.70 1124.0
568 0.2397 ... 39.42 184.60 1821.0
569 0.1587 ... 30.37 59.16 268.6
smoothness_worst compactness_worst concavity_worst \
0 0.16220 0.66560 0.7119
1 0.12380 0.18660 0.2416
2 0.14440 0.42450 0.4504
3 0.20980 0.86630 0.6869
4 0.13740 0.20500 0.4000
.. ... ... ...
564 0.14070 0.41860 0.6599
565 0.14100 0.21130 0.4107
567 0.11390 0.30940 0.3403
568 0.16500 0.86810 0.9387
569 0.08996 0.06444 0.0000
concave points_worst symmetry_worst fractal_dimension_worst \
0 0.2654 0.4601 0.11890
1 0.1860 0.2750 0.08902
2 0.2430 0.3613 0.08758
3 0.2575 0.6638 0.17300
4 0.1625 0.2364 0.07678
.. ... ... ...
564 0.2542 0.2929 0.09873
565 0.2216 0.2060 0.07115
567 0.1418 0.2218 0.07820
568 0.2650 0.4087 0.12400
569 0.0000 0.2871 0.07039
diagnosis_M
0 True
1 True
2 True
3 True
4 True
.. ...
564 True
565 True
567 True
568 True
569 False
[480 rows x 32 columns]
original_data = encoded_data.copy()
# Create a function to detect and handle outliers using IQR method
def handle_outliers_using_iqr(df_column):
# Check if the column contains numeric values
if pd.api.types.is_numeric_dtype(df_column):
# Convert the column to a numeric type, ignoring non-numeric values
cleaned_data = pd.to_numeric(df_column, errors='coerce')
# Drop NaN values
cleaned_data = cleaned_data.dropna()
if not cleaned_data.empty:
# Calculate the first and third quartiles
q1 = cleaned_data.quantile(0.25)
q3 = cleaned_data.quantile(0.75)
# Calculate the IQR (Interquartile Range)
iqr_value = iqr(cleaned_data)
# Define the lower and upper bounds for outliers
lower_bound = q1 - 1.5 * iqr_value
upper_bound = q3 + 1.5 * iqr_value
# Identify and handle outliers
outliers = (cleaned_data < lower_bound) | (cleaned_data > upper_bound)
cleaned_data[outliers] = np.nan # Set outliers to NaN
else:
cleaned_data = df_column
return cleaned_data
# Apply the function to each column in the original_data DataFrame
cleaned_data = original_data.applymap(handle_outliers_using_iqr)
# Display the cleaned dataset
print(cleaned_data)
id radius_mean texture_mean perimeter_mean area_mean \
0 842302.0 17.99 10.38 122.80 1001.0
1 842517.0 20.57 17.77 132.90 1326.0
2 84300903.0 19.69 21.25 130.00 1203.0
3 84348301.0 11.42 20.38 77.58 386.1
4 84358402.0 20.29 14.34 135.10 1297.0
.. ... ... ... ... ...
564 926125.0 20.92 25.09 143.00 1347.0
565 926424.0 21.56 22.39 142.00 1479.0
567 926954.0 16.60 28.08 108.30 858.1
568 927241.0 20.60 29.33 140.10 1265.0
569 92751.0 7.76 24.54 47.92 181.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.30010 0.14710
1 0.08474 0.07864 0.08690 0.07017
2 0.10960 0.15990 0.19740 0.12790
3 0.14250 0.28390 0.24140 0.10520
4 0.10030 0.13280 0.19800 0.10430
.. ... ... ... ...
564 0.10990 0.22360 0.31740 0.14740
565 0.11100 0.11590 0.24390 0.13890
567 0.08455 0.10230 0.09251 0.05302
568 0.11780 0.27700 0.35140 0.15200
569 0.05263 0.04362 0.00000 0.00000
symmetry_mean ... texture_worst perimeter_worst area_worst \
0 0.2419 ... 17.33 184.60 2019.0
1 0.1812 ... 23.41 158.80 1956.0
2 0.2069 ... 25.53 152.50 1709.0
3 0.2597 ... 26.50 98.87 567.7
4 0.1809 ... 16.67 152.20 1575.0
.. ... ... ... ... ...
564 2.1000 ... 29.41 179.10 1819.0
565 0.1726 ... 26.40 166.10 2027.0
567 0.1590 ... 34.12 126.70 1124.0
568 0.2397 ... 39.42 184.60 1821.0
569 0.1587 ... 30.37 59.16 268.6
smoothness_worst compactness_worst concavity_worst \
0 0.16220 0.66560 0.7119
1 0.12380 0.18660 0.2416
2 0.14440 0.42450 0.4504
3 0.20980 0.86630 0.6869
4 0.13740 0.20500 0.4000
.. ... ... ...
564 0.14070 0.41860 0.6599
565 0.14100 0.21130 0.4107
567 0.11390 0.30940 0.3403
568 0.16500 0.86810 0.9387
569 0.08996 0.06444 0.0000
concave points_worst symmetry_worst fractal_dimension_worst \
0 0.2654 0.4601 0.11890
1 0.1860 0.2750 0.08902
2 0.2430 0.3613 0.08758
3 0.2575 0.6638 0.17300
4 0.1625 0.2364 0.07678
.. ... ... ...
564 0.2542 0.2929 0.09873
565 0.2216 0.2060 0.07115
567 0.1418 0.2218 0.07820
568 0.2650 0.4087 0.12400
569 0.0000 0.2871 0.07039
diagnosis_M
0 True
1 True
2 True
3 True
4 True
.. ...
564 True
565 True
567 True
568 True
569 False
[480 rows x 32 columns]
Splitting the dataset into training and test sets is essential for accurately assessing a model's performance, preventing overfitting, tuning hyperparameters, avoiding data leakage, and selecting the best model for a given task. In this standard practice, typically, the training set comprises 80% of the data, while the test set comprises the remaining 20%. This division allows for robust evaluation of the model's generalization ability while ensuring adequate training data for model learning.
from sklearn.model_selection import train_test_split
# Split the dataset
X = cleaned_data.drop(['id', 'diagnosis_M'], axis=1)
y = cleaned_data['diagnosis_M']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The StandardScaler is a versatile preprocessing technique that offers several advantages, including improved convergence, equal contribution of features, stabilized variance, simplified interpretation, facilitated regularization, improved algorithm performance, and handling of outliers.
from sklearn.preprocessing import StandardScaler
# Create a StandardScaler instance
scaler = StandardScaler()
# Fit and transform the training data using the Standard scaler
X_train_scaled = scaler.fit_transform(X_train)
# Transform the testing data using the Standard scaler
X_test_scaled = scaler.transform(X_test)
The given dataset exhibits a non-normal distribution and contains both zero and negative values, which limits the options for data transformation. The Log Transform and The Box-Cox Transform are unsuitable in this scenario due to the presence of zero and negative values. However, The Root Transform, specifically the square root transform, remains a viable option. This transformation technique is widely used in statistics and data analysis to address skewed data or mitigate outlier effects. By applying the square root to each observation, the data distribution tends to become more symmetric, facilitating certain statistical analyses. Moreover, the square root transform has the potential to stabilize the variance of the data, thus ensuring consistency across different levels of the independent variable. This feature is particularly advantageous in regression analysis scenarios.
import numpy as np
# Transform the training data and testing data using the square root
X_train_sqrt_transformed = np.sqrt(X_train_scaled)
X_test_sqrt_transformed = np.sqrt(X_test_scaled)
C:\Users\User\AppData\Local\Temp\ipykernel_10812\3660086394.py:3: RuntimeWarning: invalid value encountered in sqrt X_train_sqrt_transformed = np.sqrt(X_train_scaled) C:\Users\User\AppData\Local\Temp\ipykernel_10812\3660086394.py:4: RuntimeWarning: invalid value encountered in sqrt X_test_sqrt_transformed = np.sqrt(X_test_scaled)
The dataset focuses on cancer diagnosis, where "diagnosis" identifies tumors as malignant (M) or benign (B). A distribution graph displays the frequency of each diagnosis type, with benign cases outnumbering malignant ones. This imbalance poses challenges, potentially skewing analysis and model predictions. Reasons for the disparity may include tumor occurrence rates or data biases. To ensure accurate insights, addressing this imbalance through techniques like resampling or alternative modeling approaches is crucial. Understanding and mitigating class imbalance are essential for reliable cancer diagnosis insights, shaping subsequent analysis steps.
import numpy as np
from scipy.stats import shapiro
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
sns.countplot(x='diagnosis_M', data=cleaned_data)
plt.title('Distribution of Diagnosis (Malignant vs. benign)')
plt.show()
This figure depicts the correlation matrix of the dataset, showcasing the relationships between different columns. High correlation values between columns indicate strong linear relationships, suggesting that changes in one variable coincide with changes in another. The presence of highly correlated columns implies redundancy or multicollinearity within the dataset. In other words, some features may convey similar information, potentially affecting the performance of machine learning models by introducing noise or instability. Addressing multicollinearity through feature selection or dimensionality reduction techniques can enhance model interpretability and predictive accuracy.
# Correlation matrix
correlation_matrix = cleaned_data.corr()
# Heatmap visualization of correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
# Pair plot for selected numerical variables
sns.pairplot(cleaned_data[['radius_mean', 'texture_mean', 'perimeter_mean', 'diagnosis_M']], hue='diagnosis_M')
plt.title('Pair Plot')
plt.show()
sns.pairplot(cleaned_data, hue='diagnosis_M', markers=['o', 's'])
plt.suptitle('Pairwise Scatter Plots with Target Variable')
plt.show()